Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors
نویسندگان
چکیده
Harnessing the full performance potential of cache-coherent distributed shared memory multiprocessors without inordinate user effort requires a compilation technology that can automatically manage multiple levels of memory hierarchy. This paper describes a working compiler for such machines that automatically partitions loops and data arrays to optimize locality of access. The compiler implements a solution to the problem of finding communication-minimal partitions of loops and data. Loop and data partitions specify the distribution of loop iterations and array data across processors. A good loop partition maximizes the cache hit rate while a good data partition minimizes remote cache misses. The problems of finding loop and data partitions interact when multiple loops access arrays with differing reference patterns. Our algorithm handles programs with multiple nested parallel loops accessingmany arrays with array access indices being general affine functions of loop variables. It discovers communication-minimal partitions when communication-free partitions do not exist. The compiler also uses sub-blocking to handle finite cache sizes. A cost model that estimates the cost of a loop and data partition given machine parameters such as cache, local and remote access timings, is presented. Minimizing the cost as estimated by our model is an NP-complete problem, as is the fully general problem of partitioning. A heuristic method which provides good approximate solutions in polynomial time is presented. The loop and data partitioning algorithm has been implemented in the compiler for the MIT Alewife machine. The paper presents results obtained from a working compiler on a 16-processor machine for three real applications: Tomcatv, Erlebacher, and Conduct. Our results demonstrate that combined optimization of loops and data can result in improvements in runtime by nearly a factor of two over optimization of loops alone.
منابع مشابه
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors
This paper presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency tra c on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication tra c, the problem of deriving the optimal tiling parameters for minimal communication in loops with general a ne index expres...
متن کاملExecuting Nested Parallel Loops on Shared-Memory Multiprocessors
Cache-coherent, bus-based shared-memory multiprocessors are a cost-e ective platform for parallel processing. In scienti c parallel applications, most of the computation involves processing of large multidimensional data structures which results in a high degree of data parallelism. This parallelism can be exploited in the form of nested parallel loops. Most existing shared memory multiprocesso...
متن کاملIntegrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors
This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronizatio...
متن کاملEXECUTING NESTED PARALLEL LOOPS ON SHARED - MEMORYMULTIPROCESSORSSadun
Cache-coherent, bus-based shared-memory multiprocessors are a cost-eeective platform for parallel processing. In scientiic parallel applications, most of the computation involves processing of large multidimensional data structures which results in a high degree of data parallelism. This parallelism can be exploited in the form of nested parallel loops. Most existing shared memory multiprocesso...
متن کاملPartitioning Regular Applications for Cache-coherent Multiprocessors
In all massively parallel systems (MPPs), whether message-passing or shared-address space, the memory is physically distributed for scalability and the latency of accessing remote data is orders of magnitude higher than the processor cycle time. Therefore, the programmer/compiler must not only identify parallelism but also specify the distribution of data among the processor memories in order t...
متن کامل